Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset
نویسندگان
چکیده
The most important factor of classification for improving classification accuracy is the training data. However, the data in real-world applications often are imbalanced class distribution, that is, most of the data are in majority class and little data are in minority class. In this case, if all the data are used to be the training data, the classifier tends to predict that most of the incoming data belong to the majority class. Hence, it is important to select the suitable training data for classification in the imbalanced class distribution problem. In this paper, we propose cluster-based under-sampling approaches for selecting the representative data as training data to improve the classification accuracy for minority class in the imbalanced class distribution problem. The experimental results show that our cluster-based under-sampling approaches outperform the other under-sampling techniques in the previous studies.
منابع مشابه
Improving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering
Classification is an one of the important parts of data mining and knowledge discovery. In most cases, the data that is utilized to used to training the clusters is not well distributed. This inappropriate distribution occurs when one class has a large number of samples but while the number of other class samples is naturally inherently low. In general, the methods of solving this kind of prob...
متن کاملAddressing the Class Imbalance Problem in Medical Datasets
A well balanced dataset is very important for creating a good prediction model. Medical datasets are often not balanced in their class labels. Most existing classification methods tend to perform poorly on minority class examples when the dataset is extremely imbalanced. This is because they aim to optimize the overall accuracy without considering the relative distribution of each class. In thi...
متن کاملImbalanced Multiclass Data Classification Using Ant Colony Optimization Algorithm
Class imbalance problems have drawn increasing interest lately because of its classification trouble caused by imbalanced class deliveries and poor prediction performance for minority class. This problem is particularly common in preparation and can be detected in various disciplines including fraud detection, anomaly detection, oil spillage detection, medical diagnosis, facial recognition. Man...
متن کاملProposing a Novel Cost Sensitive Imbalanced Classification Method based on Hybrid of New Fuzzy Cost Assigning Approaches, Fuzzy Clustering and Evolutionary Algorithms
In this paper, a new hybrid methodology is introduced to design a cost-sensitive fuzzy rule-based classification system. A novel cost metric is proposed based on the combination of three different concepts: Entropy, Gini index and DKM criterion. In order to calculate the effective cost of patterns, a hybrid of fuzzy c-means clustering and particle swarm optimization algorithm is utilized. This ...
متن کاملClassification in Imbalanced Datasets
In this thesis we study the classification task in the presence of class imbalanced data. This task arises in many applications when we are interested in the under-represented (minority) classes. Examples of such applications are related to fraud detection, medical diagnosis and monitoring, text categorization, risk management, information retrieval and filtering. Although there exist many stan...
متن کامل